parallel::detectCores()[1] 10
Today’s content has been drawn from:
Roger Peng’s book: R Programming for data science, Ch 22
Norman Matloff’s book: The Art of R Programming, Ch 16
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t have it yet…)
(Checkout CRAN Task View on HPC)
In raw terms
Supercomputer: A single big machine with thousands of cores/gpus.
High Performance Computing (HPC): Multiple machines within a single network.
High Throughput Computing (HTC): Multiple machines across multiple networks.
You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accessible these days, e.g. AWS provides a service to create HPC clusters at a low cost
Now, how many cores does your computer has, the parallel package can tell you that:
parallel::detectCores()[1] 10
f <- function(n) n*2
f(1:4)f <- function(n) n*2
f(1:4)Let’s think before we start…
“In almost any parallel-processing application, you encounter overhead, or “wasted” time spent on noncomputational activity.” Matloff, ch 16, pg 337. See 16.4 for sources of overhead.
While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism.
- parallel: R package that provides ‘[s]upport for parallel computation, including random-number generation’.
- foreach: R package for ‘general iteration over elements’ in parallel fashion.
- future: ‘[A] lightweight and unified Future API for sequential and parallel processing of R expression via futures.’ (won’t cover here)
Implicit parallelism or hidden parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow, and data.table
And there’s also a more advanced set of options
- A ton of other type of resources, notably the tools for working with batch schedulers such as Slurm, HTCondor, etc.
Based on the snow (simple network of workstations) and multicore R Packages.
Explicit parallelism.
Simple yet powerful idea: Parallel computing as multiple R sessions.
Clusters can be made of both local and remote sessions
Multiple types of cluster: PSOCK, Fork, MPI, etc.
“For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the major limitation is that the fork doesn’t work on the Windows system.”
“On the other hand, the socket works on all operating systems. Each thread runs separately without sharing objects or variables, which can only be passed from the master process explicitly. As a result, it runs slower due to the communication overhead.”
(Usually) We do the following:
Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)
Copy/prepare each R session (if you are using a PSOCK cluster):
Copy objects with parallel::clusterExport
Pass expressions with parallel::clusterEvalQ
Set a seed
Do your call: parallel::parApply, parallel::parLapply, etc.
Stop the cluster with parallel::clusterStop
# 1. CREATING A CLUSTER
library(parallel)
cores <- parallel::detectCores()
cl <- parallel::makePSOCKcluster(cores)
x <- 20
# 2. PREPARING THE CLUSTER
parallel::clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
parallel::clusterExport(cl, "x")
# 3. DO YOUR CALL
parallel::clusterEvalQ(cl, {
paste0("Hello from process #", Sys.getpid(), ". I see x and it is equal to ", x)
})[[1]]
[1] "Hello from process #97409. I see x and it is equal to 20"
[[2]]
[1] "Hello from process #97412. I see x and it is equal to 20"
[[3]]
[1] "Hello from process #97408. I see x and it is equal to 20"
[[4]]
[1] "Hello from process #97407. I see x and it is equal to 20"
[[5]]
[1] "Hello from process #97405. I see x and it is equal to 20"
[[6]]
[1] "Hello from process #97404. I see x and it is equal to 20"
[[7]]
[1] "Hello from process #97411. I see x and it is equal to 20"
[[8]]
[1] "Hello from process #97410. I see x and it is equal to 20"
[[9]]
[1] "Hello from process #97406. I see x and it is equal to 20"
[[10]]
[1] "Hello from process #97413. I see x and it is equal to 20"
# 4. STOP THE CLUSTER
parallel::stopCluster(cl)Problem: Run multiple regressions for each column of a very wide dataset. Report the \(\beta\) coefficients from fitting the following regression models:
\[ y = X_{i}\beta_i + \varepsilon_{i},\quad \varepsilon_{i} \sim N(0, \sigma^2_i) \]
\[ \quad\forall \text{ columns } i \in \left\{1, \dots, 999 \right\} \]
set.seed(131)
y <- rnorm(500)
X <- matrix(rnorm(500*999), nrow = 500, dimnames = list(1:500, sprintf("x%03d", 1:999)))dim(X)[1] 500 999
X[1:6, 1:5] x001 x002 x003 x004 x005
1 0.61827227 1.72847041 -1.4810695 -0.2471871 1.4776281
2 0.96777456 -0.19358426 -0.8176465 0.6356714 0.7292221
3 -0.04303734 -0.06692844 0.9048826 -1.9277964 2.2947675
4 0.84237608 -1.13685605 -1.8559158 0.4687967 0.9881953
5 -1.91921443 1.83865873 0.5937039 -0.1410556 0.6507415
6 0.59146153 0.81743419 0.3348553 -1.8771819 0.8181764
str(y) num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...
Serial solution:
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Use apply (for loop) to solve it
ans <- apply(
X = X,
MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
ans[,1:5] x001 x002 x003 x004 x005
(Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
x -0.06082548 0.02748265 -0.01327855 -0.08012361 -0.04067826
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Parallel solution: Use parApply
library(parallel)
cl <- makePSOCKcluster(4L)
ans <- parApply(
cl = cl,
X = X,
MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
ans[,1:5] x001 x002 x003 x004 x005
(Intercept) -0.02094065 -0.01630664 -0.01565541 -0.01848773 -0.015305816
x 0.09269758 -0.05233096 0.02893588 0.02404687 0.009151671
Are we going any faster? Compare the timing using the bench package.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
library(bench)
mark(
parallel = parApply(
cl = cl,
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
serial = apply(
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
)# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 parallel 74.9ms 78.4ms 12.5 11.8MB 5.36
2 serial 228.4ms 229.9ms 4.31 98.8MB 41.7
There may be smarter ways, though this is how I approached it: SeqSGPV package
For more, checkout the CRAN Task View on HPC
We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.
So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)
The R code to do this
pisim <- function(i, nsim) { # Notice we don't use the -i-
# Random points
ans <- matrix(runif(nsim*2), ncol=2)
# Distance to the origin
ans <- sqrt(rowSums(ans^2))
# Estimated pi
(sum(ans <= 1)*4)/nsim
}library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)
# Number of simulations we want each time to run
nsim <- 1e5
# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))
# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
serial = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4] test replications elapsed relative
1 parallel 1 0.085 1.000
2 serial 1 0.277 3.259
Friendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).
Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).
Use arma objects, e.g. arma::mat, arma::vec, etc. Or, if you are used to them std::vector objects as these are thread safe.
Pseudo-Random Number Generation is not very straight forward… But C++11 has a nice set of functions that can be used together with OpenMP
Need to think about how processors work, cache memory, etc. Otherwise you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing
If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):
~$ R --debugger=valgrinddevtools::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Monterey 12.6
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Denver
date 2023-02-14
pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
callr 3.7.2 2022-08-22 [1] CRAN (R 4.2.0)
cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.2.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.2.0)
pkgload 1.3.0 2022-06-27 [1] CRAN (R 4.2.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.0)
processx 3.7.0 2022-07-07 [1] CRAN (R 4.2.0)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
ps 1.7.1 2022-06-18 [1] CRAN (R 4.2.0)
purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
rmarkdown 2.17 2022-10-07 [1] CRAN (R 4.2.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.0)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.0)
xfun 0.33 2022-09-12 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────